Dimensionality Reduction

Dimensions as a matrix

Consider a measurement of two proteins density in a couple of cells

Cell 1 Cell 2 Cell 3
Protein 1 1.2 3.1 0.5
Protein 2 2.8 0.9 3.5

One way to represent this is on a plot of protein 1 vs protein 2, and Cells 1, 2, 3 being dots on that plot.

using GLMakie
protein1 = [1.2, 3.1, 0.5]
protein2 = [2.8, 0.9, 3.5]
cell_labels = ["Cell 1", "Cell 2", "Cell 3"]

fig = Figure(size = (600, 400))
ax = Axis(fig[1, 1],
          xlabel="Protein 1 Density",
          ylabel="Protein 2 Density")

scatter!(ax, protein1, protein2, markersize = 10)
text!(ax, protein1, protein2, text = cell_labels, align = (:left, :bottom), offset = (5, 5), fontsize=12)

fig
Figure 1: Protein 1 vs Protein 2 Density

If we have lots of cells, we can see that some cells have more similar number of proteins, than other cells. And we can categorize those cells in certain populations, give them names, and think about them via those categories. Humans love to categorize stuff and then argue about the categories.

using GLMakie
using Random

n_points_per_cluster = 50

# Cluster 1: "Red Team"
cluster1_x = randn(n_points_per_cluster) .+ 2.5
cluster1_y = randn(n_points_per_cluster) .+ 2.5

# Cluster 2: "Blue Team"
cluster2_x = randn(n_points_per_cluster) .- 2.5
cluster2_y = randn(n_points_per_cluster) .- 2.5

fig = Figure(size = (700, 500))
ax = Axis(fig[1, 1],
          xlabel="Protein X Density",
          ylabel="Protein Y Density",
          title="Cell Populations")
scatter!(ax, cluster1_x, cluster1_y, color = (:red, 0.7), label = "Red Team", markersize=5)
scatter!(ax, cluster2_x, cluster2_y, color = (:blue, 0.7), label = "Blue Team", markersize=5)
axislegend(ax, position = :rt) # position: :rt (right top), :ct (center top), etc.

fig
Figure 2: Cell Populations based on Protein Density (GLMakie)

More cells, mean more points. But another protein, is a whole new dimension.

10

So a matrix \(M_{ij}\) can be thought of as \(i\) dimensions with \(j\) data points. There are other ways to think about this. A higher rank tensor \(M_ijk\) is a whole different thing, and takes a lot more work to think about.

A common thing a biologist wants to do is look at millions of cells, across all the genes. There are ~20,000 genes in the human genome, so that is a 20,000 dimensional space. There is no good way to plot that. So we want to bring the number of dimensions down.

Principal Component Analysis